Data for the comparison

This experiment aims to compare the performance of the AmazonForecast automated solutions against classical statistical models using StatsForecast using the M5 and M4 datasets.

In this notebook we will explain the data used in the experiment.

import pandas as pd

from statsforecast import StatsForecast
/home/ubuntu/fede/statsforecast/statsforecast/core.py:21: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm

M5 dataset

Target data

train_df =  pd.read_parquet('s3://m5-benchmarks/data/train/target.parquet')
train_df.head()
item_id timestamp demand
0 FOODS_1_001_CA_1 2011-01-29 3.0
1 FOODS_1_001_CA_1 2011-01-30 0.0
2 FOODS_1_001_CA_1 2011-01-31 0.0
3 FOODS_1_001_CA_1 2011-02-01 1.0
4 FOODS_1_001_CA_1 2011-02-02 4.0
train_df = train_df.rename(columns={'item_id': 'unique_id', 
                                    'timestamp': 'ds',
                                    'demand': 'y'})
StatsForecast.plot(train_df)

Static variables

static_df = pd.read_parquet('s3://m5-benchmarks/data/train/static.parquet')
static_df.head()
item_id sku_id dept_id cat_id store_id state_id
0 FOODS_1_001_CA_1 FOODS_1_001 FOODS_1 FOODS CA_1 CA
1 FOODS_1_001_CA_2 FOODS_1_001 FOODS_1 FOODS CA_2 CA
2 FOODS_1_001_CA_3 FOODS_1_001 FOODS_1 FOODS CA_3 CA
3 FOODS_1_001_CA_4 FOODS_1_001 FOODS_1 FOODS CA_4 CA
4 FOODS_1_001_TX_1 FOODS_1_001 FOODS_1 FOODS TX_1 TX

Temporal variables

temporal_df = pd.read_parquet('s3://m5-benchmarks/data/train/temporal.parquet')
temporal_df.head()
item_id timestamp snap_CA snap_TX snap_WI sell_price
0 FOODS_1_001_CA_1 2011-01-29 0.0 0.0 0.0 2.0
1 FOODS_1_001_CA_1 2011-01-30 0.0 0.0 0.0 2.0
2 FOODS_1_001_CA_1 2011-01-31 0.0 0.0 0.0 2.0
3 FOODS_1_001_CA_1 2011-02-01 1.0 1.0 0.0 2.0
4 FOODS_1_001_CA_1 2011-02-02 1.0 0.0 1.0 2.0
temporal_df = temporal_df.rename(columns={'item_id': 'unique_id', 
                                          'timestamp': 'ds'})
StatsForecast.plot(train_df, temporal_df)

M4 Daily dataset

train_df = pd.read_parquet('s3://m4-benchmarks/data/train/target.parquet')
train_df.head()
item_id timestamp target_value
0 D1 2019-03-18 1017.1
1 D1 2019-03-19 1019.3
2 D1 2019-03-20 1017.0
3 D1 2019-03-21 1019.2
4 D1 2019-03-22 1018.7
train_df = train_df.rename(columns={'item_id': 'unique_id', 
                                    'timestamp': 'ds',
                                    'target_value': 'y'})
StatsForecast.plot(train_df)

Give us a ⭐ on Github